NLTK TwitterMixer

by Afton Wilky

The NLTK TwitterMixer generates new text based on tweets scraped from the Twitter API.

Twitter data is collect by a simple search query.

Writing functions scrape part of speech patterns from sentences and compile words from the text into a part of speech, python dictionary. They create new sentences by looping through the part of speech pattern and making a random choice from the part of speech dictionary.

Words used in the new text can be restricted to only those which contain a specified list of phonemes (i.e. particular sounds, like the consonant 'R' and/or the vowel, 'O'). Because the functions implement the CMU Pronouncing Dictionary, rather than regex search, differences in spelling don't affect the results (e.g. 'ER' will return both hurt, HH ER T, and heard, HH ER D).

Packages required:

Natural Language Toolkit (NLTK) (including the CMU Pronouncing Dict Corpora, CMUDict) installed on your local machine (see installation / getting started instructions at http://www.nltk.org/install.html). If you are using Anaconda or Miniconda, the NLTK package for Anaconda/Miniconda must also be installed.

Tweepy


In [1]:
# NLTK TwitterMixer, Copyright (C) 2017 Afton Wilky
# Author: Afton Wilky <aftonwilky.com>
# License: MIT

Twitter

Text remixed by the TwitterMixer is scraped from the Twitter API using the Tweepy (3.5.0) module. Tweepy Documentation: http://docs.tweepy.org/en/v3.5.0/

Twitter Import Statements


In [2]:
import json
import tweepy

Access Twitter API: keys and tokens

Use the Tweepy module to get access to the Twitter API. Store consumer key and secret, access token and secret in a local file called 'twconfig.py'.

Instructions

Sign up for or log into a Twitter Developer account: https://dev.twitter.com/

Navigate to or create an app to get OAuth credentials, tokens and keys: https://apps.twitter.com/

Access tokens and keys are availabile in the "Manage Keys and Access Tokens."

Create a local file 'twconfig.py'. Assign the values from your Twitter app to the following variables:

consumer_key,

consumer_secret,

access_token,

access_secret

Save the local file and import it.


In [3]:
# Import local file that stores tokens and keys

from twconfig import *

Tweepy documentation on using your tokens and keys: http://docs.tweepy.org/en/v3.5.0/auth_tutorial.html#auth-tutorial


In [4]:
# Get Twitter API Access

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

Query Twitter API and Process the Data

The following are very basic functions to query the Twitter API and store the results in variables and .txt files, using plain-text / string and Python-dict formats so the data is easy to work with.

Currently, Tweepy does not support limiting searches by date or the number of results (unless they are paginated).

An alternative to Tweepy is the Twython module, which is more full-featured. However, because of dependencies, (requests-mock) the package can no longer be used with Anaconda / Miniconda.


In [5]:
#######################
#  Query Twitter API  #
#######################

def twitter_search(query):
    """Query the Twitter API"""
    tweets = []
    for tweet in api.search(q = query):
        tweets.append(tweet._json)
    return tweets
    
    
##################
#  Process Data  #
##################

def process_tweets(data):
    """
    Process the results of a Twitter API query.
    Adds 'text', 'created_at' and 'hashtags' fields
    to Python dictionary.
    """
    # get the text and date fields
    processed_data = []
    for tweet in data:
        entry = {}
        entry['text'], entry['created_at'] = tweet['text'], tweet['created_at']
        if tweet['entities']['hashtags'] != []:
            hashtags = []
            for hashtag in tweet['entities']['hashtags']:
                hashtags.append(hashtag['text'])
            entry['hashtags'] = hashtags
        else:
            entry['hashtags'] = []
        processed_data.append(entry)
    return processed_data


def get_string(tweets):
    """
    Converts unprocessed Twitter data into a string.
    """
    text = ''
    for tweet in process_tweets(tweets):
        text = text + tweet['text'] + '\n'
    return text   


##########################
#  Write Data to Files   #
##########################

def write_file_tweets(processed_data, filepath, filename):
    with open(filepath + filename + '.txt', 'wb') as f:
        for tweet in processed_data:
            f.write(bytes(tweet['text'] + '\n', 'utf-8'))


def write_file_hashtags(processed_data, filepath, filename):
    with open(filepath + filename + '.txt', 'wb') as f:
        for tweet in processed_data:
            for hashtag in tweet['hashtags']:
                f.write(bytes(hashtag + '\n', 'utf-8'))

Call functions to create variables and files storing raw and processed Twitter data.


In [6]:
########################
#  Query Twitter API   #
########################

hello_raw = twitter_search('hello')


##############################################################
#  Convert raw Twitter query data into a Python dictionary.  #
##############################################################

hello = process_tweets(hello_raw)


###################################################
#  Convert raw Twitter query data into a string   #
#  that can be used by NLTK.                      #
###################################################

hello_text = get_string(hello_raw)


######################################################
#  Write processed Twitter hashtag data to a file    #                 
#  Note: replace argument #2 with the appropriate    #
#  filepath for your computer                        #
######################################################

# write_file_hashtags(hello, '/Users/USERNAME/', 'testing')


######################################################
#  Write processed Twitter tweets data to a file     #
#  Note: replace argument #2 with the appropriate    #
#  filepath for your computer                        #
######################################################

# write_file_tweets(hello, '/Users/USERNAME/', 'testing')

NLTK

The Natural Language Toolkit (NLTK) provides users with access to a series of corpora and functions useful for parsing and working with text. Information about the toolkit and documentation is available at http://www.nltk.org/.

NLTK functions require plaintext files (.txt) or variables assigned to values processed by the PlainTextCorpusReader NLTK function.

e.g.

tweets_all = PlaintextCorpusReader(FILEPATH, 'tweets_all.txt')

hashtags_all = PlaintextCorpusReader(FILEPATH, 'hashtags_all.txt')

NLTK Import Statements


In [7]:
import nltk
import nltk.data


import random
from collections import defaultdict, OrderedDict


from nltk.corpus import PlaintextCorpusReader
from nltk.corpus import CategorizedPlaintextCorpusReader
from nltk.corpus import cmudict
from nltk.corpus import wordnet


from nltk import load_parser
from nltk.tokenize import *
from nltk.probability import *
from nltk.misc.wordfinder import wordfinder
from nltk.text import Text

Variables

Set variable to access the CMU Pronunciation Dictionary


In [8]:
prondict = nltk.corpus.cmudict.dict()

Basic Functions


In [9]:
def write_file(x, filepath):
    """Writes a file and generates filename from first 20 characters"""
    bad_file_chars = ['>', '<', ':', '"', '/', '\\', "|", '*', ' ', '\?', '\u2014', '\u2019']
    filename = str(x[:20])
    for char in bad_file_chars:
        filename = filename.replace(char, '_')
    with open(filepath, 'wb') as f:
        f.write(x.encode('utf-8'))


def write_json_file(data, filepath):
    with open(filepath + '.json', 'w') as f:
        json.dump(data, f)


def read_file(filepath, filename):
    """Reads file"""
    return PlaintextCorpusReader(filepath, filename)


def tokenize(text):
    sentences = nltk.sent_tokenize(text.raw())
    words = [nltk.word_tokenize(sentence.lower()) for sentence in sentences]
    return words


def process(text):
    words = tokenize(text)
    tagged_words = [dict(nltk.pos_tag(word)) for word in words]
    return tagged_words

Basic NLTK Functions


In [10]:
def get_words_longer_than(length, text):
    """Returns words from a text that are longer than a length."""
    longer_words = set([w for w in text.words() if len(w) > length])
    print(longer_words)


def get_words_by_char(text, ch):
    """Returns words in a text that contain specified character or string."""
    ch_words = [w.lower() for w in text.words() if ch in w]
    return ch_words

CMU Pronouncing Dictionary Functions

The CMU Pronouncing Dictionary is a corpus of words (English) and their pronunciation, available through the Natural Language Toolkit.

Pronunciations are broken into a series of 1- and 2-letter 'phones' that represent the sound of each syllable. Each word is also labeled with information about stressed and unstressed syllables (0: No stress; 1: Primary stress; 2: Secondary stress).

Information about the dictionary is available at http://www.speech.cs.cmu.edu/cgi-bin/cmudict


In [11]:
######################################################
#  Helper Functions - CMU Pronunciation Dictionary   #
######################################################

def get_safe_cmudict_list(text):
    """
    Returns a LIST of words in a text that can be 
    processed by the CMU Pronuncation Dictionary.
    """
    # get_cmudict_error_words(text)
    cmu_dict_words = sorted([w.lower() for w in text.words() 
                             if w.lower() in prondict])
    return cmu_dict_words


def get_safe_cmudict_set(text):
    """
    Returns a SET of words in a text that can be 
    processed by the CMU Pronuncation Dictionary.
    """
    return set(get_safe_cmudict_list(text))

Phone-based Functions (CMU Pronouncing Dictionary)

For information about 'phones', see the "Phoneme Set" section of the CMU Dict information page: http://www.speech.cs.cmu.edu/cgi-bin/cmudict


In [12]:
def get_phones(text):
    """Returns a list of CMU Dictionary phones in a text."""
    cmu_dict_words = get_safe_cmudict_list(text)
    return sorted([ph for w in cmu_dict_words for ph in prondict[w][0]])


def get_phones_by_freq(text):
    """Returns a list of CMU Dictionary phones in a text, sorted by frequency."""
    return get_by_freq(get_phones(text))


def get_words_by_phone(phone, text):
    """Returns a list of words in a text that contain a specified phone."""
    cmu_dict_words = get_safe_cmudict_set(text)
    return sorted(set([w.lower() for w in cmu_dict_words for 
                       ph in prondict[w][0] if phone[:2] in ph]))


def get_phone_words(phone_list, text):
    return set([word for phone in phone_list for 
                word in get_words_by_phone(phone, text)])

Lexical / Grammatical Functions


In [13]:
def get_grammar(text):
    """return a list containing the part of speech order of a sentences in a text"""
    processed_text = process(text)
    grammar = OrderedDict()
    for i in range(0, len(processed_text)):
        grammar[i] = []
        for k, v in processed_text[i].items():
            # print grammar[i]
            grammar[i].append(v)
    return grammar

Dictionary Functions


In [14]:
def make_pos_dict(text):
    """Returns a dictionary of words sorted by their part of speech."""
    tagged_words = process(text)
    pos = defaultdict(set)
    for i in range(0, len(tagged_words)):
        for k, v in tagged_words[i].items():
            pos[v].add(k)
    return pos


def make_word_pos_dict(text):
    """Returns a flat dict of words and parts of speach in text."""
    tagged_words = process(text)
    word_pos_dict = {}
    for i in range(0, len(tagged_words)):
        for word, pos in tagged_words[i].items():
            word_pos_dict[word] = pos
    return word_pos_dict


def make_phone_pos_dict(phone_list, text):
    """
    Returns a dictionary of words containing the specified phones, 
    sorted by their part of speech
    """
    pos = make_pos_dict(text)
    phone_words = get_phone_words(phone_list, text)
    phone_pos_dict = defaultdict(list)
    for k, v in pos.items():
        for word in v:
            if word in phone_words:
                phone_pos_dict[k].append(word)
    return phone_pos_dict

Writing Functions


In [15]:
def write_sentences(text):
    """
    Writes sentences using the part of speech patterns and words 
    from the input text.
    """
    # scrape part of speech patterns from input text
    grammar = get_grammar(text)
    # compile a dictionary, associating each word with a part of speech
    pos_dict = make_pos_dict(text)
    new_text = []
    # go through each sentence in grammar
    for i in range(0, len(grammar)): 
        # for each pos in grammar[i] make a random choice from the pos_dict
        for pos in grammar[i]:
            word = random.choice(list(pos_dict[pos]))
            # append that choice to new_text
            new_text.append(word)
    return ' '.join(new_text)


def get_new_text(word_dict, grammar):
    """
    Writes new text based on an input grammar and dictionary of words 
    associated with their part of speech.
    """
    new_text = []
    # go through each sentence in grammar
    for i in range(0, len(grammar)): 
        # for each pos in grammar[i] make a random choice from the phone_pos_dict
        for pos in grammar[i]:
            if word_dict[pos]:
                word = random.choice(list(word_dict[pos]))
                # append that choice to new_text
                new_text.append(word)
    return ' '.join(new_text)


def write_phone_sentences(phone_list, text):
    """
    Writes sentences using the part of speech patterns and words 
    from the input text that contain specified phones.
    """
    grammar = get_grammar(text)
    phone_pos_dict = make_phone_pos_dict(phone_list, text)
    return get_new_text(phone_pos_dict, grammar)

Examples

Write new text and save it to a file, reading text from file


In [16]:
##############################
#  Read text from .txt file  #
##############################

# new_file = read_file('FILEPATH', 'FILENAME.txt')

###  OR  ###

# tweets = PlaintextCorpusReader('FILEPATH', 'FILENAME.txt')


################################################################
# Write sentences based on the file read in the previous step  #
################################################################

# Write new sentences based on words and grammar in a text
# new_text = write_sentences(new_file)

###  OR  ###

# Write new sentences with specified phones
# based on words and grammar in a text
# new_text = write_phone_sentences(['OW', 'OY', 'UW', 'AO'], new_file)


#############################################
#  Create a new file containing the         #
#  sentences written in the previous step   #
#############################################

# write_file(new_text)
# print(new_new_text)